Gapmindezr world is sweedish website creted to fight misconception by providing and collecting releable statistics helping to use facts to understand the world here. We have choose in this report to work with:
-Total oil production. :The total amouint of crude oil (tonne oil equivalent)
-total oil consumption :he total amount of crude oil consumption(tonne oil equivalent)
-Oil consumption per person (tonne per year per person)
-Oil poduction per person (tonne per person)
-Coutries with region sub regions table here
-Income per person (GDP per capita): gross domestic per person adjusted for difference in purshasing power(in internatoional dollars)
In this report we will work to anlyse data provided to answer these question:
-What is the total oil produced and consumed in the world since the 60s, and what is the total share of each regon?Is there a relation betwwen oil production, consumption per capita and GDP?
We need first to install Plotly libary to get beautiful Filled Area Plots that we will use in this rapport link: here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df_opc=pd.read_csv('oil_production_per_person.csv')
df_occ=pd.read_csv('oil_consumption_per_cap.csv')
df_op=pd.read_csv('oil_production_total.csv')
df_oc=pd.read_csv('oil_consumption_total.csv')
df_r=pd.read_csv('iso_region.csv')
df_gdp=pd.read_csv('income_per_person.csv')
We checked all csv files in Excel to make sure names of countries used in Gapminders datasets and names of countries in regions table from Data Hub match. There is some mismatch like the "United state3 and "Russia", that we fix in Excel Then we use head() to sj=ow the first 5 raws in Jupyter
df_op.head()
| country | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Angola | 655k | 631k | 537k | 750k | 2.46M | 5.07M | 5.72M | 7.06M | 8.18M | ... | 88.9M | 82M | 85.3M | 85.2M | 83.3M | 88.2M | 85.8M | 81.6M | 74.1M | 69.1M |
| 1 | United Arab Emirates | 13.7M | 17.4M | 18.5M | 24.1M | 30.4M | 37.8M | 51.3M | 58.4M | 74.2M | ... | 135M | 150M | 156M | 163M | 163M | 176M | 182M | 176M | 177M | 180M |
| 2 | Argentina | 13.8M | 14.6M | 16M | 17.5M | 18.1M | 20M | 21.6M | 22.2M | 21.6M | ... | 33.3M | 31.2M | 30.8M | 30.2M | 29.8M | 30M | 28.6M | 27.2M | 27.5M | 28.8M |
| 3 | Australia | 346k | 445k | 1.04M | 1.93M | 2.23M | 8.7M | 15.3M | 16.3M | 20.4M | ... | 24.5M | 21.5M | 21.4M | 17.8M | 19.1M | 17M | 15.6M | 14.9M | 15.2M | 20.6M |
| 4 | Azerbaijan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 51.3M | 46.1M | 43.7M | 43.8M | 42.5M | 42M | 41.4M | 39.1M | 39.2M | 38.1M |
5 rows × 56 columns
df_oc.head()
| country | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United Arab Emirates | 69.6k | 74.3k | 79.8k | 89.1k | 99k | 115k | 136k | 179k | 280k | ... | 31.3M | 33.8M | 35.2M | 39M | 39.3M | 42.2M | 44.9M | 44.2M | 46M | 44.6M |
| 1 | Argentina | 22M | 22.8M | 23.3M | 23.8M | 24.7M | 22.1M | 23.6M | 23.6M | 23.7M | ... | 27.5M | 27.4M | 28.6M | 30.7M | 30.1M | 30.9M | 30.2M | 29.4M | 28.1M | 27.6M |
| 2 | Australia | 15.1M | 18M | 19.7M | 21.5M | 22.1M | 24.4M | 25.7M | 26.1M | 27.9M | ... | 41.6M | 44.4M | 46.1M | 46.9M | 47.4M | 45.6M | 46M | 48.1M | 49.5M | 49.1M |
| 3 | Austria | 5.56M | 6.12M | 6.51M | 7.54M | 8.28M | 9.1M | 10.2M | 11M | 11.9M | ... | 12.7M | 11.9M | 11.9M | 12.2M | 11.7M | 11.7M | 12.1M | 12.3M | 12.4M | 12.7M |
| 4 | Azerbaijan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.26M | 3.99M | 4.17M | 4.54M | 4.45M | 4.54M | 4.52M | 4.56M | 4.78M | 4.93M |
5 rows × 56 columns
#Check number of lignes and columns
df_op.shape, df_oc.shape
((50, 56), (80, 56))
Both oil production and consupmtion Begin with datas from 1965 but consumption tablke have more lignes , this is normal because more countries consumes oil than producers and both contains the abriviation K and M resoectively for thounsands and millions, we need to convert numerical number. There is also some NaN values there are mainly country recently creted like coutries of the former URSS. We need fill these value with zeros. and to make joining region data easy and creating two tables, we need to convert to invert line and column but we will deal with that later.
#check for duplication
df_op.duplicated()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False 32 False 33 False 34 False 35 False 36 False 37 False 38 False 39 False 40 False 41 False 42 False 43 False 44 False 45 False 46 False 47 False 48 False 49 False dtype: bool
df_oc.duplicated()
0 False
1 False
2 False
3 False
4 False
...
75 False
76 False
77 False
78 False
79 False
Length: 80, dtype: bool
There is no duplicated data in both oil tables
#we check the oil production per capita
df_opc.head()
| country | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Angola | 0.1140 | 0.1090 | 0.0930 | 0.130 | 0.424 | 0.860 | 0.947 | 1.130 | 1.260 | ... | 3.810 | 3.390 | 3.400 | 3.280 | 3.090 | 3.160 | 2.970 | 2.740 | 2.400 | 2.170 |
| 1 | United Arab Emirates | 91.2000 | 109.0000 | 109.0000 | 132.000 | 150.000 | 161.000 | 185.000 | 176.000 | 188.000 | ... | 15.800 | 16.800 | 17.100 | 17.700 | 17.700 | 19.000 | 19.500 | 18.600 | 18.300 | 18.400 |
| 2 | Argentina | 0.6210 | 0.6510 | 0.6990 | 0.755 | 0.770 | 0.838 | 0.889 | 0.901 | 0.862 | ... | 0.814 | 0.755 | 0.739 | 0.715 | 0.699 | 0.698 | 0.658 | 0.620 | 0.621 | 0.644 |
| 3 | Australia | 0.0306 | 0.0384 | 0.0873 | 0.158 | 0.178 | 0.680 | 1.170 | 1.230 | 1.520 | ... | 1.110 | 0.955 | 0.934 | 0.766 | 0.809 | 0.710 | 0.643 | 0.608 | 0.609 | 0.818 |
| 4 | Azerbaijan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.680 | 5.040 | 4.720 | 4.670 | 4.470 | 4.370 | 4.260 | 3.970 | 3.940 | 3.790 |
5 rows × 56 columns
#we check the oil consumption per capita
df_occ.head()
| country | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United Arab Emirates | 0.464 | 0.464 | 0.47 | 0.488 | 0.488 | 0.491 | 0.490 | 0.541 | 0.710 | ... | 3.660 | 3.780 | 3.850 | 4.240 | 4.260 | 4.550 | 4.800 | 4.660 | 4.780 | 4.570 |
| 1 | Argentina | 0.995 | 1.010 | 1.02 | 1.030 | 1.050 | 0.926 | 0.973 | 0.959 | 0.945 | ... | 0.672 | 0.664 | 0.684 | 0.727 | 0.706 | 0.718 | 0.694 | 0.670 | 0.632 | 0.616 |
| 2 | Australia | 1.340 | 1.550 | 1.66 | 1.760 | 1.760 | 1.910 | 1.970 | 1.970 | 2.070 | ... | 1.880 | 1.970 | 2.010 | 2.020 | 2.010 | 1.910 | 1.890 | 1.960 | 1.990 | 1.950 |
| 3 | Austria | 0.761 | 0.832 | 0.88 | 1.010 | 1.110 | 1.210 | 1.350 | 1.450 | 1.560 | ... | 1.510 | 1.400 | 1.390 | 1.430 | 1.360 | 1.340 | 1.380 | 1.390 | 1.400 | 1.420 |
| 4 | Azerbaijan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.361 | 0.436 | 0.450 | 0.484 | 0.468 | 0.471 | 0.465 | 0.463 | 0.481 | 0.490 |
5 rows × 56 columns
The oil production and consumption per caita table containt also Nan values and K and M coefficient. we will only keep data between 2000 and 2019, we limit our studu to 20 years to delete data from former countries that do not exist anymore.
#check region table
df_r.head()
| name | alpha-2 | alpha-3 | country-code | iso_3166-2 | region | sub-region | intermediate-region | region-code | sub-region-code | intermediate-region-code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 4 | ISO 3166-2:AF | Asia | Southern Asia | NaN | 142.0 | 34.0 | NaN |
| 1 | Åland Islands | AX | ALA | 248 | ISO 3166-2:AX | Europe | Northern Europe | NaN | 150.0 | 154.0 | NaN |
| 2 | Albania | AL | ALB | 8 | ISO 3166-2:AL | Europe | Southern Europe | NaN | 150.0 | 39.0 | NaN |
| 3 | Algeria | DZ | DZA | 12 | ISO 3166-2:DZ | Africa | Northern Africa | NaN | 2.0 | 15.0 | NaN |
| 4 | American Samoa | AS | ASM | 16 | ISO 3166-2:AS | Oceania | Polynesia | NaN | 9.0 | 61.0 | NaN |
Wel will need to rename "name" colum to "country" in order to join to other table
#checking gdp per capita
df_gdp.head()
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | ... | 2690 | 2750 | 2810 | 2870 | 2930 | 2990 | 3060 | 3120 | 3190 | 3260 |
| 1 | Angola | 700 | 702 | 705 | 709 | 711 | 714 | 718 | 721 | 725 | ... | 8000 | 8170 | 8350 | 8530 | 8710 | 8900 | 9090 | 9280 | 9480 | 9690 |
| 2 | Albania | 755 | 755 | 755 | 755 | 755 | 756 | 756 | 756 | 756 | ... | 25.1k | 25.6k | 26.2k | 26.7k | 27.3k | 27.9k | 28.5k | 29.1k | 29.7k | 30.4k |
| 3 | Andorra | 1360 | 1360 | 1360 | 1360 | 1370 | 1370 | 1370 | 1370 | 1380 | ... | 68.9k | 70.4k | 71.9k | 73.4k | 75k | 76.6k | 78.3k | 80k | 81.7k | 83.4k |
| 4 | United Arab Emirates | 1130 | 1130 | 1140 | 1140 | 1150 | 1150 | 1160 | 1160 | 1160 | ... | 101k | 103k | 105k | 107k | 110k | 112k | 114k | 117k | 119k | 122k |
5 rows × 252 columns
Data for GDP begin in the year 1800 and ends in 2050, for our work we will only need dat from 2000 to 2019 so we need first to keep only data from 1800 to 2019 and then remove data from 1800 to 1999
#keep only dat bewwen 1880 and 2019
df_gdp_20 = df_gdp.loc[:,'country':'2019']
#delete dat betwween 1800 and 1999
df_gdp_20=df_gdp_20.drop(df_gdp_20.loc[:, '1800':'1999'].columns, axis=1)
#GDP per caoita for la last 20 years df_gdp_20
df_gdp_20.head()
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 584 | 569 | 1190 | 1240 | 1200 | 1290 | 1320 | 1460 | 1480 | ... | 1960 | 1910 | 2080 | 2120 | 2100 | 2070 | 2060 | 2060 | 2030 | 2070 |
| 1 | Angola | 2340 | 2510 | 2930 | 3150 | 3560 | 4310 | 5610 | 6960 | 7850 | ... | 7690 | 7680 | 8040 | 8140 | 8240 | 8040 | 7570 | 7310 | 6930 | 6670 |
| 2 | Albania | 5890 | 6440 | 6750 | 7150 | 7580 | 8040 | 8570 | 9150 | 9910 | ... | 10.7k | 11.1k | 11.2k | 11.4k | 11.6k | 11.9k | 12.3k | 12.8k | 13.3k | 13.7k |
| 3 | Andorra | 35.8k | 36k | 36.1k | 39.1k | 41.1k | 45k | 48.4k | 49.2k | 46.9k | ... | 44.1k | 47.5k | 47.5k | 49.5k | 50.9k | 52.7k | 54.5k | 56.3k | 58.3k | 58.4k |
| 4 | United Arab Emirates | 71.6k | 72.2k | 73.5k | 78.6k | 82.5k | 80.8k | 86.4k | 76.6k | 68.8k | ... | 54.9k | 56.1k | 57.4k | 59.9k | 62.5k | 65.3k | 66.5k | 67.2k | 67k | 68.3k |
5 rows × 21 columns
#Renaming column "name" to "country" replacing "sub-region' by'sub_region"
df_r.rename(columns={'name':'country','sub-region':'sub_region'}, inplace='True')
df_r.head()
| country | alpha-2 | alpha-3 | country-code | iso_3166-2 | region | sub_region | intermediate-region | region-code | sub-region-code | intermediate-region-code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 4 | ISO 3166-2:AF | Asia | Southern Asia | NaN | 142.0 | 34.0 | NaN |
| 1 | Åland Islands | AX | ALA | 248 | ISO 3166-2:AX | Europe | Northern Europe | NaN | 150.0 | 154.0 | NaN |
| 2 | Albania | AL | ALB | 8 | ISO 3166-2:AL | Europe | Southern Europe | NaN | 150.0 | 39.0 | NaN |
| 3 | Algeria | DZ | DZA | 12 | ISO 3166-2:DZ | Africa | Northern Africa | NaN | 2.0 | 15.0 | NaN |
| 4 | American Samoa | AS | ASM | 16 | ISO 3166-2:AS | Oceania | Polynesia | NaN | 9.0 | 61.0 | NaN |
We need now to reduce df_opc and df_occ to only year of study meaning 2000 to 2019
#drop column from 1965 to 1999 by using the drop function
df_opc_20=df_opc.drop(df_opc.loc[:, '1965':'1999'].columns, axis=1)
#checking oil prodction per capita table
df_opc_20.head()
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Angola | 2.25 | 2.16 | 2.55 | 2.36 | 2.91 | 3.21 | 3.41 | 3.880 | 4.250 | ... | 3.810 | 3.390 | 3.400 | 3.280 | 3.090 | 3.160 | 2.970 | 2.740 | 2.400 | 2.170 |
| 1 | United Arab Emirates | 38.80 | 35.90 | 31.30 | 34.00 | 32.30 | 29.80 | 27.50 | 23.300 | 20.500 | ... | 15.800 | 16.800 | 17.100 | 17.700 | 17.700 | 19.000 | 19.500 | 18.600 | 18.300 | 18.400 |
| 2 | Argentina | 1.12 | 1.18 | 1.15 | 1.13 | 1.07 | 1.01 | 1.00 | 0.963 | 0.943 | ... | 0.814 | 0.755 | 0.739 | 0.715 | 0.699 | 0.698 | 0.658 | 0.620 | 0.621 | 0.644 |
| 3 | Australia | 1.95 | 1.79 | 1.76 | 1.50 | 1.30 | 1.26 | 1.15 | 1.170 | 1.130 | ... | 1.110 | 0.955 | 0.934 | 0.766 | 0.809 | 0.710 | 0.643 | 0.608 | 0.609 | 0.818 |
| 4 | Azerbaijan | 1.73 | 1.83 | 1.85 | 1.84 | 1.83 | 2.60 | 3.74 | 4.970 | 5.140 | ... | 5.680 | 5.040 | 4.720 | 4.670 | 4.470 | 4.370 | 4.260 | 3.970 | 3.940 | 3.790 |
5 rows × 21 columns
#drop column from 1965 to 1999 by using the drop function
df_occ_20=df_occ.drop(df_occ.loc[:, '1965':'1999'].columns, axis=1)
#checking oil consumption per capita table
df_occ_20.head()
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United Arab Emirates | 6.270 | 5.870 | 5.760 | 5.970 | 5.890 | 5.440 | 5.070 | 4.650 | 4.250 | ... | 3.660 | 3.780 | 3.850 | 4.240 | 4.260 | 4.550 | 4.800 | 4.660 | 4.780 | 4.570 |
| 1 | Argentina | 0.551 | 0.524 | 0.479 | 0.488 | 0.507 | 0.534 | 0.554 | 0.609 | 0.621 | ... | 0.672 | 0.664 | 0.684 | 0.727 | 0.706 | 0.718 | 0.694 | 0.670 | 0.632 | 0.616 |
| 2 | Australia | 2.030 | 2.020 | 1.990 | 1.970 | 2.000 | 1.950 | 2.080 | 2.010 | 1.980 | ... | 1.880 | 1.970 | 2.010 | 2.020 | 2.010 | 1.910 | 1.890 | 1.960 | 1.990 | 1.950 |
| 3 | Austria | 1.470 | 1.590 | 1.620 | 1.740 | 1.680 | 1.670 | 1.660 | 1.560 | 1.540 | ... | 1.510 | 1.400 | 1.390 | 1.430 | 1.360 | 1.340 | 1.380 | 1.390 | 1.400 | 1.420 |
| 4 | Azerbaijan | 0.779 | 0.488 | 0.443 | 0.511 | 0.545 | 0.630 | 0.557 | 0.519 | 0.407 | ... | 0.361 | 0.436 | 0.450 | 0.484 | 0.468 | 0.471 | 0.465 | 0.463 | 0.481 | 0.490 |
5 rows × 21 columns
we need to change the structure of oil production, oil consumption and income per capita data frame melt function. Melting data is the process of turning columns of our data into rows of data.
#inverting oil prooduction table (_m for melt)
df_op_m=pd.melt(df_op,['country'],var_name='year',value_name='oil_production').sort_values(by=['country','year']).reset_index(drop='True')
#check oil production table
df_op_m.head()
| country | year | oil_production | |
|---|---|---|---|
| 0 | Algeria | 1965 | 26.1M |
| 1 | Algeria | 1966 | 33.6M |
| 2 | Algeria | 1967 | 38.6M |
| 3 | Algeria | 1968 | 42.4M |
| 4 | Algeria | 1969 | 44.4M |
#inverting oil consumption table (_m for melt)
df_oc_m=pd.melt(df_oc,['country'],var_name='year',value_name='oil_consumption').sort_values(by=['country','year']).reset_index(drop='True')
#checking consumption table
df_oc_m.head()
| country | year | oil_consumption | |
|---|---|---|---|
| 0 | Algeria | 1965 | 1.29M |
| 1 | Algeria | 1966 | 1.69M |
| 2 | Algeria | 1967 | 1.58M |
| 3 | Algeria | 1968 | 1.68M |
| 4 | Algeria | 1969 | 1.77M |
#inverting GDP table
df_gdp_m=pd.melt(df_gdp_20,['country'],var_name='year',value_name='gdp_capita').sort_values(by=['country','year']).reset_index(drop='True')
df_gdp_m.head()
| country | year | gdp_capita | |
|---|---|---|---|
| 0 | Afghanistan | 2000 | 584 |
| 1 | Afghanistan | 2001 | 569 |
| 2 | Afghanistan | 2002 | 1190 |
| 3 | Afghanistan | 2003 | 1240 |
| 4 | Afghanistan | 2004 | 1200 |
#inverting oil production per capita table (_m for melt)
df_opc_m=pd.melt(df_opc_20,['country'],var_name='year',value_name='oil_prod_capita').sort_values(by=['country','year']).reset_index(drop='True')
#check tablle
df_opc_m.head()
| country | year | oil_prod_capita | |
|---|---|---|---|
| 0 | Algeria | 2000 | 2.15 |
| 1 | Algeria | 2001 | 2.09 |
| 2 | Algeria | 2002 | 2.23 |
| 3 | Algeria | 2003 | 2.45 |
| 4 | Algeria | 2004 | 2.56 |
#inverting oil consumption per capita table (_m for melt)
df_occ_m=pd.melt(df_occ_20,['country'],var_name='year',value_name='oil_cons_capita').sort_values(by=['country','year']).reset_index(drop='True')
df_occ_m.head()
| country | year | oil_cons_capita | |
|---|---|---|---|
| 0 | Algeria | 2000 | 0.273 |
| 1 | Algeria | 2001 | 0.279 |
| 2 | Algeria | 2002 | 0.306 |
| 3 | Algeria | 2003 | 0.314 |
| 4 | Algeria | 2004 | 0.324 |
df_op_m.shape
(2750, 3)
df_oc_m.shape
(4400, 3)
df_gdp_m.shape
(3900, 3)
df_op_m.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2750 entries, 0 to 2749 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 2750 non-null object 1 year 2750 non-null object 2 oil_production 2399 non-null object dtypes: object(3) memory usage: 64.6+ KB
df_oc_m.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4400 entries, 0 to 4399 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 4400 non-null object 1 year 4400 non-null object 2 oil_consumption 4084 non-null object dtypes: object(3) memory usage: 103.2+ KB
df_gdp_m.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3900 entries, 0 to 3899 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 3900 non-null object 1 year 3900 non-null object 2 gdp_capita 3900 non-null object dtypes: object(3) memory usage: 91.5+ KB
We will crete three table, a table containg oil production , oil con sumption abnd we join to them region, and another table contraing oil production and consuption per capita and join region
We need first to fill remaing null datas overall oil production an dconsumption table with zeros because it lake sense that means a country has not discovered oil yet or it ressource dry out.
# Filling Nan values with 0
df_op_m['oil_production']=df_op_m['oil_production'].fillna(0)
#Checking non null value
df_op_m.isnull().sum()
country 0 year 0 oil_production 0 dtype: int64
we will do Next relace K and M by its values we will do that using reolace function
#regex=True is needed if we want the key string to be replaced by value string (e.g. K replaced by "*1e3" in this case)
#1e3 is the scientific notation of 1000.
df_op_m['oil_production']=df_op_m['oil_production'].replace({'k':'*1e3','M':'*1e6'}, regex=True).map(pd.eval).astype(np.int64)
#Check type
df_op_m.dtypes
country object year object oil_production int64 dtype: object
#we had elminate M and K
df_op_m.head()
| country | year | oil_production | |
|---|---|---|---|
| 0 | Algeria | 1965 | 26100000 |
| 1 | Algeria | 1966 | 33600000 |
| 2 | Algeria | 1967 | 38600000 |
| 3 | Algeria | 1968 | 42400000 |
| 4 | Algeria | 1969 | 44400000 |
#we do the same for oil consumption
df_oc_m['oil_consumption']=df_oc_m['oil_consumption'].fillna(0)
df_oc_m.isnull().sum()
country 0 year 0 oil_consumption 0 dtype: int64
#checking for K and M
df_oc_m.head()
| country | year | oil_consumption | |
|---|---|---|---|
| 0 | Algeria | 1965 | 1.29M |
| 1 | Algeria | 1966 | 1.69M |
| 2 | Algeria | 1967 | 1.58M |
| 3 | Algeria | 1968 | 1.68M |
| 4 | Algeria | 1969 | 1.77M |
df_oc_m.dtypes
country object year object oil_consumption object dtype: object
country colum has an object type we need to covert it to integer
#to do that we run astype()
df_oc_m['year']=df_oc_m['year'].astype(np.int64)
#regex=True is needed if we want the key string to be replaced by value string (e.g. K replaced by "*1e3" in this case)
#1e3 is the scientific notation of 1000.
df_oc_m['oil_consumption']=df_oc_m['oil_consumption'].replace({'k':'*1e3','M':'*1e6'}, regex=True).map(pd.eval).astype(np.int64)
#Checking oil consulmmption table
df_oc_m.head()
| country | year | oil_consumption | |
|---|---|---|---|
| 0 | Algeria | 1965 | 1290000 |
| 1 | Algeria | 1966 | 1690000 |
| 2 | Algeria | 1967 | 1580000 |
| 3 | Algeria | 1968 | 1680000 |
| 4 | Algeria | 1969 | 1770000 |
#Replacing K and M par tyheir numeric values in GDp table
df_gdp_m['gdp_capita']=df_gdp_m['gdp_capita'].replace({'k':'*1e3'}, regex=True).map(pd.eval).astype(np.int64)
#We check the transformatoin by getting the vlues for example"South Africa"
df_gdp_m.loc[df_gdp_m['country']=='South Africa']
| country | year | gdp_capita | |
|---|---|---|---|
| 3140 | South Africa | 2000 | 11000 |
| 3141 | South Africa | 2001 | 11200 |
| 3142 | South Africa | 2002 | 11400 |
| 3143 | South Africa | 2003 | 11600 |
| 3144 | South Africa | 2004 | 12000 |
| 3145 | South Africa | 2005 | 12500 |
| 3146 | South Africa | 2006 | 13000 |
| 3147 | South Africa | 2007 | 13600 |
| 3148 | South Africa | 2008 | 13800 |
| 3149 | South Africa | 2009 | 13400 |
| 3150 | South Africa | 2010 | 13600 |
| 3151 | South Africa | 2011 | 13800 |
| 3152 | South Africa | 2012 | 13900 |
| 3153 | South Africa | 2013 | 14100 |
| 3154 | South Africa | 2014 | 14000 |
| 3155 | South Africa | 2015 | 14000 |
| 3156 | South Africa | 2016 | 13900 |
| 3157 | South Africa | 2017 | 13900 |
| 3158 | South Africa | 2018 | 13900 |
| 3159 | South Africa | 2019 | 13700 |
#we keep only "country", "region" and "sub_region" from region table
df_r=df_r.loc[:,['country','region','sub_region']]
df_r
| country | region | sub_region | |
|---|---|---|---|
| 0 | Afghanistan | Asia | Southern Asia |
| 1 | Åland Islands | Europe | Northern Europe |
| 2 | Albania | Europe | Southern Europe |
| 3 | Algeria | Africa | Northern Africa |
| 4 | American Samoa | Oceania | Polynesia |
| ... | ... | ... | ... |
| 244 | Wallis and Futuna | Oceania | Polynesia |
| 245 | Western Sahara | Africa | Northern Africa |
| 246 | Yemen | Asia | Western Asia |
| 247 | Zambia | Africa | Sub-Saharan Africa |
| 248 | Zimbabwe | Africa | Sub-Saharan Africa |
249 rows × 3 columns
# we create a function merged-region that add region and had two arguments and use the country to perfomr the join
def merged_region (first_table, second_table):
merged = pd.merge(first_table,second_table,how='inner',on='country').drop_duplicates(keep=False)
return merged
df_op_m=merged_region(df_op_m,df_r)
df_op_m.head()
| country | year | oil_production | region | sub_region | |
|---|---|---|---|---|---|
| 0 | Algeria | 1965 | 26100000 | Africa | Northern Africa |
| 1 | Algeria | 1966 | 33600000 | Africa | Northern Africa |
| 2 | Algeria | 1967 | 38600000 | Africa | Northern Africa |
| 3 | Algeria | 1968 | 42400000 | Africa | Northern Africa |
| 4 | Algeria | 1969 | 44400000 | Africa | Northern Africa |
#we need to convert "year" to 'int'
df_op_m.dtypes
country object year object oil_production int64 region object sub_region object dtype: object
#to do that we run astype()
df_op_m['year']=df_op_m['year'].astype(np.int64)
df_op_m.dtypes
country object year int64 oil_production int64 region object sub_region object dtype: object
# we call merged function to merge oil consumption table and region
df_oc_m=merged_region(df_oc_m,df_r)
df_oc_m.head()
| country | year | oil_consumption | region | sub_region | |
|---|---|---|---|---|---|
| 0 | Algeria | 1965 | 1290000 | Africa | Northern Africa |
| 1 | Algeria | 1966 | 1690000 | Africa | Northern Africa |
| 2 | Algeria | 1967 | 1580000 | Africa | Northern Africa |
| 3 | Algeria | 1968 | 1680000 | Africa | Northern Africa |
| 4 | Algeria | 1969 | 1770000 | Africa | Northern Africa |
df_opc_m.shape, df_occ_m.shape, df_gdp_m.shape
((980, 3), (1580, 3), (3900, 3))
#now we perforl the first join and we drop duplicated columns
df_oil=pd.merge(df_opc_m,df_occ_m,how='inner',on=['country','year']).drop_duplicates(keep=False)
df_oil.head()
| country | year | oil_prod_capita | oil_cons_capita | |
|---|---|---|---|---|
| 0 | Algeria | 2000 | 2.15 | 0.273 |
| 1 | Algeria | 2001 | 2.09 | 0.279 |
| 2 | Algeria | 2002 | 2.23 | 0.306 |
| 3 | Algeria | 2003 | 2.45 | 0.314 |
| 4 | Algeria | 2004 | 2.56 | 0.324 |
#next we add the GDP per capita
df_oil=pd.merge(df_oil,df_gdp_m,how='inner',on=['country','year']).drop_duplicates(keep=False)
#check the join
df_oil.head()
| country | year | oil_prod_capita | oil_cons_capita | gdp_capita | |
|---|---|---|---|---|---|
| 0 | Algeria | 2000 | 2.15 | 0.273 | 8710 |
| 1 | Algeria | 2001 | 2.09 | 0.279 | 8860 |
| 2 | Algeria | 2002 | 2.23 | 0.306 | 9230 |
| 3 | Algeria | 2003 | 2.45 | 0.314 | 9770 |
| 4 | Algeria | 2004 | 2.56 | 0.324 | 10100 |
#we call merged_region function to join region
df_oil=merged_region(df_oil,df_r)
df_oil.head()
| country | year | oil_prod_capita | oil_cons_capita | gdp_capita | region | sub_region | |
|---|---|---|---|---|---|---|---|
| 0 | Algeria | 2000 | 2.15 | 0.273 | 8710 | Africa | Northern Africa |
| 1 | Algeria | 2001 | 2.09 | 0.279 | 8860 | Africa | Northern Africa |
| 2 | Algeria | 2002 | 2.23 | 0.306 | 9230 | Africa | Northern Africa |
| 3 | Algeria | 2003 | 2.45 | 0.314 | 9770 | Africa | Northern Africa |
| 4 | Algeria | 2004 | 2.56 | 0.324 | 10100 | Africa | Northern Africa |
#check types
df_oil.dtypes
country object year object oil_prod_capita float64 oil_cons_capita float64 gdp_capita int64 region object sub_region object dtype: object
#change year to int
df_oil['year'] = df_oil['year'].astype(np.int64)
df_oil.dtypes
country object year int64 oil_prod_capita float64 oil_cons_capita float64 gdp_capita int64 region object sub_region object dtype: object
#first we use groupby on "country" to know with coutry is the top producer of oil since the 60's
df_op_bar=df_op_m.groupby('country').sum().sort_values(by='oil_production',ascending=False)['oil_production']
df_op_bar
country United States 24370000000 Saudi Arabia 21812000000 Russia 16186000000 China 7264500000 Canada 6689800000 Mexico 6507400000 Kuwait 6068620000 Iraq 5720200000 United Arab Emirates 5636999999 Nigeria 4862980000 Norway 4022893000 Libya 3987899997 Indonesia 3270499999 Algeria 3226599999 Brazil 2901189999 Qatar 2052299999 Angola 1935973000 Kazakhstan 1772899999 Egypt 1741929998 Oman 1691219999 Argentina 1593899999 India 1470499998 Colombia 1356219999 Malaysia 1234725397 Australia 1202691000 Ecuador 937777000 Azerbaijan 871649999 Gabon 590359999 Romania 484780000 Yemen 411607999 Trinidad and Tobago 405319999 Denmark 373713199 Peru 341280000 Thailand 328962998 Turkmenistan 300570000 Equatorial Guinea 273636299 Sudan 243536500 Tunisia 221660999 Italy 202680000 Uzbekistan 160970000 Chad 104589999 South Sudan 46300000 Name: oil_production, dtype: int64
# we querry for Saudi Arabia the biggest oil producer and draw it production
df_usa = df_op_m.query('country=="United States"')
df_usa.plot.line(x='year', y='oil_production', figsize=(8,6))
<AxesSubplot:xlabel='year'>
We can see an overall increasion of production until around 1979 and then a crush as price collaspse tha episode is know by "1980s oil glut" and around 2010 an increase of production due to technical innovation and the avent of Shale oil and gas extraction,
#first we use groupby on "country" to know with coutry is the top consumer of oil since the 60's
df_oc_bar=df_oc_m.groupby('country').sum().sort_values(by='oil_consumption',ascending=False)['oil_consumption']
df_oc_bar
country
United States 43522000000
Japan 12217900000
China 11767500000
Germany 6961800000
Russia 5738000000
...
Latvia 75700000
Slovenia 71229999
Estonia 63940000
Iceland 37314000
North Macedonia 29014000
Name: oil_consumption, Length: 70, dtype: int64
# we querry for Japan the biggest oil consumer and draw it production
df_usa_c = df_oc_m.query('country=="United States"')
df_usa_c.plot.line(x='year', y='oil_consumption', figsize=(8,6))
<AxesSubplot:xlabel='year'>
We clearly see the same patten let now see if this is general trend not local to thosee coutries
# we group nby "year and "region to get each region production
df_op_reg=df_op_m.groupby(['year','region']).sum()
df_op_reg
| oil_production | ||
|---|---|---|
| year | region | |
| 1965 | Africa | 105995000 |
| Americas | 530039000 | |
| Asia | 358867800 | |
| Europe | 15080000 | |
| Oceania | 346000 | |
| ... | ... | ... |
| 2019 | Africa | 365719999 |
| Americas | 1381620000 | |
| Asia | 1703320000 | |
| Europe | 659270000 | |
| Oceania | 20600000 |
275 rows × 1 columns
We will use here Plotly to get beautiful Filled Area Plots
#
dfw = df_op_m.groupby(['region','year'])['oil_production'].sum().to_frame()
dfw.reset_index(inplace=True)
fig = px.area(dfw, x="year", y="oil_production", color="region")
fig.show()
# we do the same with oil consumption table
df_oc_reg=df_oc_m.groupby(['year','region']).sum()
df_oc_reg
| oil_consumption | ||
|---|---|---|
| year | region | |
| 1965 | Africa | 14795000 |
| Americas | 670041000 | |
| Asia | 169488599 | |
| Europe | 332889999 | |
| Oceania | 17820000 | |
| ... | ... | ... |
| 2019 | Africa | 95300000 |
| Americas | 1215720000 | |
| Asia | 1742179999 | |
| Europe | 708844000 | |
| Oceania | 57280000 |
275 rows × 1 columns
#we draw the consumption by refion filled area plot
dfco = df_oc_m.groupby(['region','year'])['oil_consumption'].sum().to_frame()
dfco.reset_index(inplace=True)
fig = px.area(dfco, x="year", y="oil_consumption", color="region")
fig.show()
We see the same pattern but with a more gentle slope
#now we use the oil per cpita table and group by "year"
df_oil_mean=df_oil.groupby('year').mean()
#we draw the mean of year for each column
df_oil_mean.plot(figsize = (8,8),subplots = True, title = 'Time series of variables from 1990 to 2019');
We can see a relative decline in oil prodution and consumlption per capita but a rise on the GDP
# we create a table containning the values in year 2000
df_oil_2000 = df_oil.groupby(['country', 'year']).max().query('year == 2000')
df_oil_2000.head()
| oil_prod_capita | oil_cons_capita | gdp_capita | region | sub_region | ||
|---|---|---|---|---|---|---|
| country | year | |||||
| Algeria | 2000 | 2.150 | 0.273 | 8710 | Africa | Northern Africa |
| Argentina | 2000 | 1.120 | 0.551 | 18600 | Americas | Latin America and the Caribbean |
| Australia | 2000 | 1.950 | 2.030 | 38200 | Oceania | Australia and New Zealand |
| Azerbaijan | 2000 | 1.730 | 0.779 | 4000 | Asia | Western Asia |
| Brazil | 2000 | 0.384 | 0.491 | 11600 | Americas | Latin America and the Caribbean |
# And year 2019
df_oil_2019 = df_oil.groupby(['country', 'year']).max().query('year == 2019')
df_oil_2019.head()
| oil_prod_capita | oil_cons_capita | gdp_capita | region | sub_region | ||
|---|---|---|---|---|---|---|
| country | year | |||||
| Algeria | 2019 | 1.490 | 0.467 | 11500 | Africa | Northern Africa |
| Argentina | 2019 | 0.644 | 0.616 | 22100 | Americas | Latin America and the Caribbean |
| Australia | 2019 | 0.818 | 1.950 | 49300 | Oceania | Australia and New Zealand |
| Azerbaijan | 2019 | 3.790 | 0.490 | 14400 | Asia | Western Asia |
| Brazil | 2019 | 0.714 | 0.520 | 14800 | Americas | Latin America and the Caribbean |
#Plot GDP distribution between the year 2000 and 2019
fig, ax = plt.subplots(figsize = (8, 8))
bin_edges = np.arange(0, df_oil['gdp_capita'].max()+5000, 5000)
ax.hist(df_oil_2019['gdp_capita'], bins = bin_edges, alpha = 0.5, label = '2019')
ax.hist(df_oil_2000['gdp_capita'], bins = bin_edges, alpha = 0.5, label = '2000')
ax.set_title('Distribution of GDP per capita')
ax.set_xlabel('GDP per capita in US$')
ax.set_ylabel('Number of countries')
ax.legend(loc = 'upper right')
plt.show()
The distubution of GDP per capita is right skewwed but a tendancy to little more evenly distribtion in 2019 than 2000
# corrolation between columns
df_oil[['gdp_capita', 'oil_prod_capita', 'oil_cons_capita']].corr()
| gdp_capita | oil_prod_capita | oil_cons_capita | |
|---|---|---|---|
| gdp_capita | 1.000000 | 0.730939 | 0.815496 |
| oil_prod_capita | 0.730939 | 1.000000 | 0.807735 |
| oil_cons_capita | 0.815496 | 0.807735 | 1.000000 |
# we create a seaborn.regplot() used to plot data and a linear regression model fit
fig, ax = plt.subplots(figsize = (12, 12))
sns.regplot(data = df_oil_2000, x = 'oil_prod_capita', y = 'gdp_capita', fit_reg = True, scatter_kws = {'alpha':0.6}, label = 2000, ax = ax)
sns.regplot(data = df_oil_2019, x = 'oil_prod_capita', y = 'gdp_capita', fit_reg = True, scatter_kws = {'alpha':0.6}, label = 2019, ax = ax);
ax.set_title('Correlation between oil production and Economy')
ax.set_xlabel('oil production per capita')
ax.set_ylabel('GDP/capita in US$')
ax.legend(loc = 'upper right')
plt.show()
# we do the same for oil consumption and GDP
fig, ax = plt.subplots(figsize = (12, 12))
sns.regplot(data = df_oil_2000, x = 'oil_cons_capita', y = 'gdp_capita', fit_reg = True, scatter_kws = {'alpha':0.6}, label = 2000, ax = ax)
sns.regplot(data = df_oil_2019, x = 'oil_cons_capita', y = 'gdp_capita', fit_reg = True, scatter_kws = {'alpha':0.6}, label = 2019, ax = ax);
ax.set_title('Correlation between oil consumptionn and Economy')
ax.set_xlabel('oil consumption per capita')
ax.set_ylabel('GDP/capita in US$')
ax.legend(loc = 'upper right')
plt.show()
Oil consumption, production and GDP per capita show a strong positive correlation with a correlation coefficient of Pearson R = 0.77 for production and R=0.81 for consumption.
general increase trend in overall oil production and consumption since the the 60's but marked by around 1979 and then a crush as price collaspse tha episode is know by "1980s oil glut".And ezven the consumption grow, the production also follow as new technologies emerged like Shale oil.
strong correlation of education with GDP per capita and between oil production per capita and oil consumption per caoita . We can see clearly that the oil was the main drive of econolmic growth.
The amount of oil production doesn't mention the origin of oil, onshore fiels, offshore condenstae or shale oil, and the data abd thus cannot do furthur investigation
Gas production and consumption was not available as i wouil had offered a much broader picture on fossil ennergy.
Even if there is a clear relationship betwwen GDP per capita and oil productioon and consupmtion per capita, other energies forms like coal, wind solar would have drawn a broader picture and give us an idea of ennergies transition
Beside GDP we may need another metrics to measure the wellbeing and livebility like for example the human developement index